Structure detection and segmentation of documents using 2D stochastic context-free grammars

نویسندگان

  • Francisco Alvaro
  • Francisco Cruz Fernandez
  • Joan-Andreu Sánchez
  • Oriol Ramos Terrades
  • José-Miguel Benedí
چکیده

In this paper we define a bidimensional extension of Stochastic Context-Free Grammars for structure detection and segmentation of images of documents. Two sets of text classification features are used to perform an initial classification of each zone of the page. Then, the document segmentation is obtained as the most likely hypothesis according to a stochastic grammar. We used a dataset of historical marriage license books to validate this approach. We also tested several inference algorithms for Probabilistic Graphical Models and the results showed that the proposed grammatical model outperformed the other methods. Furthermore, grammars also provide the document structure along with its segmentation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multimedia and Data Management

Despite the recent advances in handwriting recognition, handwritten twodimensional (2D) languages are still a challenge. Electrical schemas, chemical equations and mathematical expressions are examples of such 2D languages. In this case, the recognition problem is particularly difficult due to the two dimensional layout of the language. The main goal of our work is to study the application of t...

متن کامل

Target detection Bridge Modelling using Point Cloud Segmentation Obtained from Photogrameric UAV

In recent years, great efforts have been made to generate 3D models of urban structures in photogrammetry and remote sensing. 3D reconstruction of the bridge, as one of the most important urban structures in transportation systems, has been neglected because of its geometric and structural complexity. Due to the UAV technology development in spatial data acquisition, in this study, the point cl...

متن کامل

Recognition of on-line handwritten mathematical expressions using 2D stochastic context-free grammars and hidden Markov models

This paper describes a formal model for the recognition of on-line handwritten mathematical expressions using 2D stochastic context-free grammars and hidden Markov models. Hidden Markov models are used to recognize mathematical symbols, and a stochastic context-free grammar is used to model the relation between these symbols. This formal model makes possible to use classic algorithms for parsin...

متن کامل

Studying impressive parameters on the performance of Persian probabilistic context free grammar parser

In linguistics, a tree bank is a parsed text corpus that annotates syntactic or semantic sentence structure. The exploitation of tree bank data has been important ever since the first large-scale tree bank, The Penn Treebank, was published. However, although originating in computational linguistics, the value of tree bank is becoming more widely appreciated in linguistics research as a whole. F...

متن کامل

Introduction to stochastic context free grammars.

Stochastic context free grammars are a formalism which plays a prominent role in RNA secondary structure analysis. This chapter provides the theoretical background on stochastic context free grammars. We recall the general definitions and study the basic properties, virtues, and shortcomings of stochastic context free grammars. We then introduce two ways in which they are used in RNA secondary ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Neurocomputing

دوره 150  شماره 

صفحات  -

تاریخ انتشار 2015